Elizaveta Semenova

Applied Machine Learning Days 2020

Bayesian Inference: embracing uncertainty.

If anything remains unclear or unanswered after this session, also if you spot mistakes or typos, contact:

  • Twitter: @liza_p_semenova

  • Github: elizavetasemenova

  • E-mail: elizaveta.p.semenova@gmail.com

Overview:

Julia:

  • Why Julia?
  • basic syntaxis
  • How to continue learning Julia after this workshop

Bayesian Inference:

  • What is Bayesian inference?
  • Why do we care?
  • How to perform Bayesian inference: a painful way and a modern way
  • Examples

Outro

  • Bayesian Neural Networks
  • Bayesian decision theory
  • Closing remarks

We will use Julia for the practical part:

In [1]:
println("All models are wrong, but some are useful.")
All models are wrong, but some are useful.

Also Python and R versions of the Bayesian block are available in this repository.

Julia

Why Julia

  • Pyhton and R are easy to learn but lack speed on native code.
  • Julia was created (released in 2012) to overcome this issue.
  • Julia is not meant to substitute the two established GPPLs (general purpose programming languages), but rather to augment them since a lot of R and Python functionality is available in Julia.
  • Julia's popularity is growing in the scientific community: https://www.nature.com/articles/d41586-019-02310-3

  • Julia provides a fertile ground for developement of probabilistic programming languages (PPLs): several native ones have been recently developed (Turing, Gen, Stheno, SOSS, Omega) indicating strong interest of the Bayesian research community.

Basic Julia syntaxis

In [2]:
##############################################
# basic Julia syntaxix
##############################################
In [3]:
println("Hello, world!")
Hello, world!
In [4]:
3 - 2.4
Out[4]:
0.6000000000000001
In [5]:
println("computed ", 3 - 4.5)
computed -1.5
In [6]:
println("Hello" * " Julia!")
Hello Julia!
In [7]:
" who" ^ 3 * ", who let the dogs out" # do you know this song?
Out[7]:
" who who who, who let the dogs out"

Arithmetic operations

  • x + y binary plus
  • x - y binary minus
  • x * y times
  • x / y divide
  • x ÷ y integer divide
  • x \ y inverse divide
  • x ^ y power
  • x % y remainder
In [8]:
println((50 / 60) ^ 2)
0.6944444444444445
In [9]:
x = 9
y = 7
x % y
Out[9]:
2
In [10]:
rem(9, 7)
Out[10]:
2
In [11]:
x ÷ y
Out[11]:
1
In [12]:
x \ y
Out[12]:
0.7777777777777778

Logical expressions

  • ! is the NOT operator
  • && is the AND operator
  • || is the OR operator

Beware:

  • & is the "bitwise AND"
  • | is "bitwise OR"
In [13]:
false      &&   true       ||  true
Out[13]:
true

Julia allows one to omit the multiplication operator when writing expressions.

In [14]:
x = 3
4x^2+x
Out[14]:
39
In [15]:
4+1<2||2+2<1
Out[15]:
false
In [16]:
a=true
b=!!!!a
Out[16]:
true

Greek letters :

type in '\theta'

This allows to copy math equatios directly, as if we would be writing them by hand.

In [17]:
θ = 3
Out[17]:
3
In [18]:
2θ
Out[18]:
6

Type system

In [19]:
typeof(1)
Out[19]:
Int64
In [20]:
typeof("Hello, world!")
Out[20]:
String
In [21]:
typeof("αβγδ")
Out[21]:
String
In [22]:
typeof("H")
Out[22]:
String
In [23]:
typeof('H')
Out[23]:
Char
In [24]:
typeof(1+3+5.)
Out[24]:
Float64

Array types

In [25]:
Array{Int64}(undef, 3)
Out[25]:
3-element Array{Int64,1}:
 4565960688
 4369189104
 4565959232
In [26]:
Array{String}(undef, 3)
Out[26]:
3-element Array{String,1}:
 #undef
 #undef
 #undef
In [27]:
'a'==="a"
Out[27]:
false
In [28]:
x=3
y=2
a = Array{Integer,2}(undef, x, y)
Out[28]:
3×2 Array{Integer,2}:
 #undef  #undef
 #undef  #undef
 #undef  #undef
In [29]:
x = Array{Int64}(undef,11, 12) 
typeof(x)
Out[29]:
Array{Int64,2}

Functions

In [30]:
a, b, c = cos(0.2), log(10), abs(-1.22)  # multiple assignment
Out[30]:
(0.9800665778412416, 2.302585092994046, 1.22)
In [31]:
myfunc(x) = 20*x
Out[31]:
myfunc (generic function with 1 method)
In [32]:
myfunc(2)
Out[32]:
40
In [33]:
add(x,y) = x + y
Out[33]:
add (generic function with 1 method)
In [34]:
add(33, -22.2)
Out[34]:
10.8
In [35]:
function nextfunc(a, b, c) 
    a*b + c                
end
Out[35]:
nextfunc (generic function with 1 method)
In [36]:
nextfunc(7,5,3)
Out[36]:
38
In [37]:
function print_type(x)
    println("The type of testvar is $(typeof(x)) and the value of testvar is $x")
end
Out[37]:
print_type (generic function with 1 method)
In [38]:
a = ['1',2.]
print_type(a)
The type of testvar is Array{Any,1} and the value of testvar is Any['1', 2.0]
In [39]:
function f(x)
  return 2x
  3x
end

f(5)
Out[39]:
10

Multiple methods

In [40]:
cos_func(x) = cos(x)
Out[40]:
cos_func (generic function with 1 method)
In [41]:
cos_func(.7)
Out[41]:
0.7648421872844885
In [42]:
cos_func(adj, hyp) = adj/hyp
Out[42]:
cos_func (generic function with 2 methods)
In [43]:
cos_func(.7)
Out[43]:
0.7648421872844885
In [44]:
cos_func(12, 13)
Out[44]:
0.9230769230769231
In [45]:
methods(cos_func)
Out[45]:
2 methods for generic function cos_func:
  • cos_func(x) in Main at In[40]:1
  • cos_func(adj, hyp) in Main at In[42]:1
In [46]:
?cos_func
search: cos_func

Out[46]:

No documentation found.

cos_func is a Function.

# 2 methods for generic function "cos_func":
[1] cos_func(x) in Main at In[40]:1
[2] cos_func(adj, hyp) in Main at In[42]:1
In [47]:
cos_func(theta::Float64) = cos(theta)   # :: forces Julia to check the type
Out[47]:
cos_func (generic function with 3 methods)

Use '?' to searach. For example, '?cos':

In [48]:
?cos
search: cos cosh cosd cosc cospi cos_func acos acosh acosd sincos const close

Out[48]:
cos(x)

Compute cosine of x, where x is in radians.


cos(A::AbstractMatrix)

Compute the matrix cosine of a square matrix A.

If A is symmetric or Hermitian, its eigendecomposition (eigen) is used to compute the cosine. Otherwise, the cosine is determined by calling exp.

Examples

jldoctest
julia> cos(fill(1.0, (2,2)))
2×2 Array{Float64,2}:
  0.291927  -0.708073
 -0.708073   0.291927

Working with arrays

In [49]:
x = [1, 6, 2, 4, 7, 2, 76, 5]
Out[49]:
8-element Array{Int64,1}:
  1
  6
  2
  4
  7
  2
 76
  5
In [50]:
size(x)
Out[50]:
(8,)
In [51]:
length(x)
Out[51]:
8
In [52]:
typeof(x)
Out[52]:
Array{Int64,1}
In [53]:
x = collect(1:7)
Out[53]:
7-element Array{Int64,1}:
 1
 2
 3
 4
 5
 6
 7
In [54]:
x = range(0, stop = 1, length = 10)
Out[54]:
0.0:0.1111111111111111:1.0
In [55]:
x = collect(x)
Out[55]:
10-element Array{Float64,1}:
 0.0               
 0.1111111111111111
 0.2222222222222222
 0.3333333333333333
 0.4444444444444444
 0.5555555555555556
 0.6666666666666666
 0.7777777777777778
 0.8888888888888888
 1.0               
In [56]:
x = fill(5, 4)
Out[56]:
4-element Array{Int64,1}:
 5
 5
 5
 5

Vectorized "dot" operators

In [57]:
x + 2
MethodError: no method matching +(::Array{Int64,1}, ::Int64)
Closest candidates are:
  +(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:502
  +(!Matched::Complex{Bool}, ::Real) at complex.jl:292
  +(!Matched::Missing, ::Number) at missing.jl:93
  ...

Stacktrace:
 [1] top-level scope at In[57]:1
In [58]:
x .+ 2
Out[58]:
4-element Array{Int64,1}:
 7
 7
 7
 7
In [59]:
x .* x
Out[59]:
4-element Array{Int64,1}:
 25
 25
 25
 25
In [60]:
data = [1.6800483  -1.641695388; 
        0.501309281 -0.977697538; 
        1.528012113 0.52771122;
        1.70012253 1.711524991; 
        1.992493625 1.891000015]
Out[60]:
5×2 Array{Float64,2}:
 1.68005   -1.6417  
 0.501309  -0.977698
 1.52801    0.527711
 1.70012    1.71152 
 1.99249    1.891   
In [61]:
size(data)
Out[61]:
(5, 2)
In [62]:
typeof(data)
Out[62]:
Array{Float64,2}
In [63]:
data[:,1] # all rows, 1st column
Out[63]:
5-element Array{Float64,1}:
 1.6800483  
 0.501309281
 1.528012113
 1.70012253 
 1.992493625
In [64]:
data = [[3,2,1] [3,2,1] [3,2,1] [3,2,1] [3,2,1] [3,2,1] [3,2,1] [3,2,1] [6,5,4]]
Out[64]:
3×9 Array{Int64,2}:
 3  3  3  3  3  3  3  3  6
 2  2  2  2  2  2  2  2  5
 1  1  1  1  1  1  1  1  4
In [65]:
data[2,:]
Out[65]:
9-element Array{Int64,1}:
 2
 2
 2
 2
 2
 2
 2
 2
 5
In [66]:
rows, cols = size(data)
println(rows)
println(cols)
3
9

Loops

In [67]:
for num = 1:2:length(x)
    println("num is now $num")
end
num is now 1
num is now 3
In [68]:
values = [ "first element", 'θ' , 42]      # an array

for x in values    
    println("The value of x is now $x")
end
The value of x is now first element
The value of x is now θ
The value of x is now 42
In [69]:
x = [3, 2, 1]

count=1
for i in x
  x[i]=count
  count=count+1
end

println(x[3])
println(x)
1
[3, 2, 1]

"If" case

if condition

...

end

In [70]:
if 5>3
    println("test passed")
end
test passed

Plots

In [71]:
using Plots
In [72]:
gr()
Out[72]:
Plots.GRBackend()
In [73]:
x = cos.( - 10:0.1:10)
plot(1:length(x), x)
Out[73]:
0 50 100 150 200 -1.0 -0.5 0.0 0.5 1.0 y1
In [74]:
plot!(title = "First plot", size = [300, 300], legend = false) # modify existing plot
Out[74]:
0 50 100 150 200 -1.0 -0.5 0.0 0.5 1.0 First plot
In [75]:
hline!([0])
Out[75]:
0 50 100 150 200 -1.0 -0.5 0.0 0.5 1.0 First plot
In [76]:
plot(1:length(x), x, 
    size = [300, 300], 
    legend=false, 
    line= :scatter,
    marker = :hex)
Out[76]:
0 50 100 150 200 -1.0 -0.5 0.0 0.5 1.0
In [77]:
x = collect(1:0.1:7)
f(x) = 2 - 2x + x^2/4 + sin(2x)
plot(x, f)
Out[77]:
1 2 3 4 5 6 7 -2 -1 0 1 y1

Save plot

In [78]:
#savefig("Plot_name.png")                           # Saved png format

Distributions.jl

In [79]:
using Distributions
In [80]:
d = Normal()
Out[80]:
Normal{Float64}(μ=0.0, σ=1.0)
In [81]:
n_draws = 1000
draws = rand(d, n_draws)
histogram(draws, size = [300, 300], bins = 40, leg = false, norm = true)
Out[81]:
-3 -2 -1 0 1 2 3 0.0 0.1 0.2 0.3 0.4
In [82]:
[pdf(d, x) for x in -3:1:3] # evaluate PDF/PMF
Out[82]:
7-element Array{Float64,1}:
 0.0044318484119380075
 0.05399096651318806  
 0.24197072451914337  
 0.3989422804014327   
 0.24197072451914337  
 0.05399096651318806  
 0.0044318484119380075
In [83]:
shape = (3, 2)
rand(Uniform(0,1), shape)
Out[83]:
3×2 Array{Float64,2}:
 0.10616   0.115299 
 0.982672  0.0514775
 0.867987  0.673957 

Please run these commands now:

In [84]:
using Distributions
using Plots
using StatsBase
using Turing

Bayesian Inference

WHAT

What is Bayesian inference?

Bayesian inference is the process of deducing properties of a probability distribution from data using Bayes’ theorem.

Bayes' theorem relies heavily on the notion of conditional probability.

Conditional probability

Conditional probability of A given B:

$$P(A \vert B)=\frac{P(A \cap B)}{P(B)}.$$

$P(A \vert B)$ - probability of event A given B has occured
$P(A \cap B)$ - probability of event A occured and B occured

We can derive

$$P(A \vert B)= P(B \vert A) \frac{P(A)}{P(B)}$$

The conditional probability of A given B is the conditional probability of B given A scaled by the relative probability of A compared to B.

Bayes' Theorem example 1: diagnosing flu

Find probability of a patient having a flue given they have high temperature.

A = to have a flu

B = to have high fever

$$P(\text{flu} \vert \text{fever})= P(\text{fever} \vert \text{flu}) \frac{P(\text{flu})}{P(\text{fever})}$$

We know that the probability of having fever this time of the year is 10%, probability of having a flue this time of the year is 7%, and among all people having a flu, 70% of them have fever.

$$ 70\% * 7\% / 10\% = 49\%.$$

Bayes' Theorem example 2: spam filtering

We would like to test a message for being a spam, by checking that it contains a predefined set of words or word combinations, which are common for spam messages ("Free investment", "Dear email"):


$$P(\text{spam } \vert \text{ words}) = \frac{ P(\text{words } \vert \text{ spam}) P(\text{spam})}{P(\text{words})}$$

In practice, we usually want to estimate the unknown model parameters $$\theta = (\theta_1, ..., \theta_k),$$ given the observed data $$y = (y_1, ..., y_n)$$ by using the Bayes' rule: $$P(\theta \vert y) = \frac{P(y \vert \theta) P(\theta)}{P(y)}.$$

In continuous case, we speak of probability densities $f(.)$.

The posterior distribution is being formed as

$$f(\theta \vert y) \propto f(y \vert \theta) f(\theta).$$

  • $f(\theta)$ is the prioir and expresses our beliefs about $\theta$

  • $f(y \vert \theta)$ is the likelihood of data $y$ as a function of $\theta$

  • we dropped $f(y)$ because the average likelihood across all parameter values $$f(y) = \int_{\theta_1} ... \int_{\theta_k} f(\theta) f(y \vert \theta) d\theta_1 ... d\theta_k = E_{\theta}f(y \vert \theta)$$ does not depend on $\theta$ (hence will not influence the inference) and is very hard to compute.

Components of the bayesian model


  • Likelihood

  • Prior

  • Posterior

$$\text{Posterior} = \frac{\text{Likelihood × Prior}}{\text{Average Likelihood}} \propto \text{Likelihood × Prior}$$

WHY

Diagnosing the cause of headache

Frequentist doctor:

  • A mental model for the cause of pain
  • Perform tests

Bayesian doctor:

  • A mental model for the cause of pain
  • History of the patient
  • Perform tests

I.e. frequentist inference is like a doctor who never reads patient's history even for patients with chronic conditions.

Frequetist vs Bayesian approaches

Frequentist:

  • probability is a long-run frequency
  • parameters are fixed,
  • parameters can only be estimated using the current data.

Bayesian:

  • probability represents one's uncertainty
  • parameters are described by distributions (i.e. reflecting uncertainty),
  • estimation is based on our belief and/or prior knowledge about parameters and the current data.

Remark: historically, Bayesian approach appeared earlier than the frequentist one.

Why have we been using frequentist methods all these years?

Because we didn't have computers! Bayesian inference is very hard to make by hand.

Why use Bayesian inference?


  • Prior knowledge

  • Flexible in uncertainty modeling, particularly under small amount of available data

  • Very flexible model formulation accounting for the mechanistic knowledge about a system

Note:

  • Results are essentially equal with enough data, i.e. the posterior approximates the MLE

Exercise

  • think of a problem relevant to your work where one could benefit from the Bayesian approach

HOW


What does it take?

  • Data,
  • A generative model,
  • Our beliefs before seeing the data.

What does it make?

  • The values of parameters that could give rise to the observed data in the form of a distribution.

How

  • analytically

    elegant, but rarely possible.

  • numerically

    • Create posterior samples, describing the distributions of parameters.

    • We achieve this by exploring the space of parameters to find the most probable combinations of parameters.

    • Further we treat the obtained sampled as new data, to extract information about parameters, such as mean, credible interval or other statistics.

Analytically

Numerically

  • Markov Chain Monte Carlo (MCMC) family of algorithm, e.g.
    • Metropolis-Hastings
    • Gibbs
    • Hamiltonian Monte Carlo (HMC)
    • No-U-Turn sampler (NUTS)
    • further variants such as SGHMC, LDHMC, etc
  • Variational Bayes
  • Approximate Bayesian Computation (ABC)
  • Particle filters
  • Laplace approximation

Priors


  • informative / non-informative

  • conjugate priors guarantee that the posterior has an easily calculable form

Some conjugate families of distributions (Robert, Casella, "Monate Carlo Statistical Methods")

Example - Priors

In [85]:
using Distributions
using Plots
using StatsBase
In [86]:
##############################################
# prioir x likelihood = posterior
##############################################

success=6

tosses=9

# Create a distribution with n = 9 (e.g. tosses) and p = 0.5.

d = Binomial(tosses, 0.5)
pdf(d, success)
Out[86]:
0.1640625000000001
In [87]:
# define grid
grid_points = 100
p_grid =  range(0, stop = 1, length = grid_points)

# compute likelihood at each point in the grid
likelihood = [pdf(Binomial(tosses, p), success) for p in p_grid]

# define prior
prior = ones(length(p_grid));
In [88]:
# As Uniform prior has been used, unstandardized posterior is equal to likelihood

# compute product of likelihood and prior
posterior = likelihood .* prior;
In [89]:
function computePosterior(likelihood, prior)
   
    # compute product of likelihood and prior
    unstd_posterior = likelihood .* prior

    # standardize posterior
    posterior = unstd_posterior / sum(unstd_posterior)
    
    p1 = plot(p_grid, prior, title = "Prior")
    p2 = plot(p_grid, likelihood , title = "Likelihood")
    p3 = plot(p_grid, posterior, title = "Posterior")
    
    plot(p1, p2, p3, layout=(1, 3), label="")

end
Out[89]:
computePosterior (generic function with 1 method)
In [90]:
prior1 = ones(length(p_grid))
posterior1 = computePosterior(likelihood, prior1)
Out[90]:
0.00 0.25 0.50 0.75 1.00 1.00 1.25 1.50 1.75 2.00 Prior 0.00 0.25 0.50 0.75 1.00 0.00 0.05 0.10 0.15 0.20 0.25 Likelihood 0.00 0.25 0.50 0.75 1.00 0.000 0.005 0.010 0.015 0.020 0.025 Posterior

Posterior is the same as the prior.

In [91]:
#prior2 = 2 * (p_grid .>= 0.5)
prior2 = 0.5 * (p_grid .>= 0.5)
posterior2 = computePosterior(likelihood, prior2)
Out[91]:
0.00 0.25 0.50 0.75 1.00 0.0 0.1 0.2 0.3 0.4 0.5 Prior 0.00 0.25 0.50 0.75 1.00 0.00 0.05 0.10 0.15 0.20 0.25 Likelihood 0.00 0.25 0.50 0.75 1.00 0.00 0.01 0.02 0.03 Posterior

Posterior vanishes at the points where prior vanishes. I.e. we need to be careful and avoid assignign zero prior values to the regions where have no evidnece of those values being impossible.

In [92]:
prior3 = exp.(-5 * abs.(p_grid .- 0.5))
posterior3 = computePosterior(likelihood, prior3)
Out[92]:
0.00 0.25 0.50 0.75 1.00 0.2 0.4 0.6 0.8 1.0 Prior 0.00 0.25 0.50 0.75 1.00 0.00 0.05 0.10 0.15 0.20 0.25 Likelihood 0.00 0.25 0.50 0.75 1.00 0.00 0.01 0.02 0.03 Posterior

Some priors, such as exponential and Laplace, allow to obtain the "shrinkage" effect.

The Monte Carlo method

is a group of algorithms which use random sampling repeatedly to make numerical estimations of unknown qunatities/parameters.

(originated within the Manhattan Project thanks to Stanislaw Ulam)

Find approximate value of $\pi.$

(Note: we are solving a deterministic problem using random number generation)

In [93]:
##############################################
# the Monte Carlo method - compute pi
##############################################
In [94]:
function in_circle(x, y, r)
    sqrt(x^2 + y^2) <= r
end
Out[94]:
in_circle (generic function with 1 method)
In [95]:
function approx_pi(r, n)
    
    xs, ys, cols = [], [], []
    
    count = 0

    for i in range(1, step=1, stop=n)
        x = rand(Uniform(0,1))
        y = rand(Uniform(0,1))
        append!(xs, x)
        append!(ys, y)

        if in_circle(x, y, r)
            count += 1
            cols = vcat(cols, :red)
        else
            cols = vcat(cols, :steelblue)
        end
    end

    pi_appr = round(4 * count/n, digits = 3)
    
    pl = scatter(xs, 
        ys, 
        color=cols, 
        size=(200,200),
        legend = false,
        xticks = false,
        yticks = false,
        framestyle = :box,
        title = "pi (approximately) = " * string(pi_appr),
        titlefontsize=font(7, "Calibri"))
    
    display(pl)
    
end
Out[95]:
approx_pi (generic function with 1 method)
In [96]:
r = 1
n = 100

for n in 5 * 10 .^[1, 2, 3]
    approx_pi(r, n)
end
pi (approximately) = 3.28
pi (approximately) = 3.072
pi (approximately) = 3.142

Monte Carlo integration

(Again, let's solve a deterministic problem using random sampling)

Find value of the integral

$$\int_a^b f(x)dx. $$

Monte Carlo integration estimates this integral by estimaing the fraction of random points that fall below $f(x)$.

In our context, we are interested in estimating expectations

$$ E[h(x)] = \int h(x)f(x)dx,$$

which can be done with

$$ \bar{h}_n = \frac{1}{n} \sum_i^n h(x_i),$$

where $x_i ∼ f$ is a draw from the density $f$.

The convergence of Monte Carlo integration is $0(n^{1/2})$ and independent of the dimensionality. Hence Monte Carlo integration generally beats numerical intergration for moderate- and high-dimensional integration since numerical integration (quadrature) converges as $0(n^d)$.

Example

Estiamte the integral $\int_0^1 e^x dx$.

In [97]:
##############################################
# the Monte Carlo method - integration
##############################################
In [98]:
exp(1) - exp(0)
Out[98]:
1.718281828459045
In [99]:
x = range(0, stop = 1, length = 100)
plot(x, exp.(x), size= [200,200], legend= false)

pts =  rand(Uniform(0,1), (100, 2)) # sample uniformly in the square
pts[:, 2] *= exp(1)

cols = fill(:steelblue, 100)

for i in range(1, step=1, stop=100)
    if pts[i,2] > exp(pts[i,1])     # acceptance / rejection step
        cols[i] = :red
    end
end

scatter!(pts[:, 1], pts[:, 2], color = cols, size=[250, 250], legend = false, xlim = [0,1], ylim = [0, exp(1)])
Out[99]:
0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 2.5
In [100]:
# Monte Carlo approximation

for n in 10 .^[1, 2, 3, 4, 5, 6, 7, 8]
    pts =  rand(Uniform(0,1), (n, 2))
    pts[:, 2] *= exp(1)
    count = sum(pts[:, 2] .< exp.(pts[:, 1]))
    volume = exp(1) * 1 # volume of region
    sol = (volume * count)/n    
    println(sol)
end
1.3591409142295225
1.4678721873678844
1.7152358337576574
1.7549227484531595
1.7204277520500142
1.7188538668713365
1.7179524846170195
1.7184223939967052

Mandatory coin tossing example

Is the coin biased if we observed (H, H, T, H, ... , T, H)?

The frequentist way - analytically

How to derive the MLE estimate?

$$L (\theta) = \theta^\text{heads} (1-\theta)^\text{tails} $$$$\log L (\theta) = \text{heads} * \log \theta + \text{tails} * \log (1-\theta) $$$$ \frac{d \log L}{ \ d \theta} = \frac{\text{heads}}{\theta} + \frac{\text{tails}}{1-\theta} = 0 $$

and so

$$ \hat{\theta} = \frac{\text{heads}}{\text{heads + tails}} $$

The frequentist way - numerically

In [101]:
##############################################
# coin tossing
##############################################
In [102]:
n = 4
h = 3
p = h/n
Out[102]:
0.75

The Bayesian way

Beta distribution:

$$ \text{Beta}_\theta(a,b) = C * \theta^{(a-1)} (1 - \theta)^{(b-1)} $$

If we compute the posterior distribution analytically, we will obtain a Beta distribution with new parameters.

In [103]:
a, b = 10, 10                   # hyperparameters
prior = Beta(a, b)              # prior
post = Beta(h+a, n-h+b)         # posterior
Out[103]:
Beta{Float64}(α=13.0, β=11.0)
In [104]:
function beta_binomial(n, h, a, b)
    # frequentist
    p = h/n
    mu = mean(Binomial(n, p))
    
    # Bayesian
    thetas = range(0, stop=1, length=200)
    prior = pdf.(Beta(a, b), thetas)

    post = pdf.(Beta(h+a, n-h+b), thetas)
    
    likelihood = n * [pdf(Binomial(n, p), h) for p in thetas];
    plot(thetas, 
         prior, 
         size= [400, 400], 
         label = "Prior",
         color = :blue,
         xlim = [0, 1],
         xlabel = "theta",
         ylabel = "Density")
    plot!(thetas, post, label = "Posterior", color = :red)
    plot!(thetas, likelihood, label="Likelihood", color = :green, legend = :topleft)
    vline!([(h+a-1)/(n+a+b-2)], color = :red, linestyle = :dash, label="MAP")
    vline!([mu / n], color = :green, linestyle = :dash, label="MLE")
 
end
Out[104]:
beta_binomial (generic function with 1 method)
In [105]:
beta_binomial(100, 80, 10, 10)
Out[105]:
0.00 0.25 0.50 0.75 1.00 0.0 2.5 5.0 7.5 10.0 theta Density Prior Posterior Likelihood MAP MLE
In [106]:
beta_binomial(4, 3, 10, 10)
Out[106]:
0.00 0.25 0.50 0.75 1.00 0 1 2 3 theta Density Prior Posterior Likelihood MAP MLE
In [107]:
beta_binomial(4, 3, 2, 2)
Out[107]:
0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 theta Density Prior Posterior Likelihood MAP MLE
In [108]:
beta_binomial(4, 3, 1, 1)
Out[108]:
0.00 0.25 0.50 0.75 1.00 0.0 0.5 1.0 1.5 2.0 theta Density Prior Posterior Likelihood MAP MLE

Markov Chain Monte Carlo

  • We want to estiamte the posterior distribution, but this is often intractable

Markov Chain Monte Carlo idea:

  • draw samples from a (simple) proposal distribution so that each draw depends only on the state of the previous draw (i.e. the samples form a Markov chain)
  • under certain conditions, the Markov chain will have a unique stationary distribution

  • we set up an acceptance criteria for each draw based on comparing successive states with respect to a target distribution that enusre that the stationary distribution is the posterior distribution we are searching for

  • there is no need to evaluate the potentially intractable marginal likelihood

  • after sufficient number of iterations, the Markov chain of accepted draws will converge to the staionary distribution, and we can use those samples as (correlated) draws from the posterior distribution, and find functions of the posterior distribution

Metropolis-Hastings random walk algorithm

  • Start with an initial guess for $\theta$

  • Chose a new proposed value as $\theta_p = \theta + \delta_\theta, \delta_\theta \sim N(0, \sigma).$

    Here we have chosen the proposal distribution to be $N(0, \sigma).$

  • If $g$ is the posterior probability, calculate the ratio $\rho = \frac{g(\theta_p \mid X)}{g(\theta \mid X)}$

  • (adjust for symmetry of the proposal distribution)

  • If $\rho \ge 1,$ accept $\theta = \theta_p;$ if $\rho < 1,$ accept $\theta = \theta_p$ with probability $p,$ otherwise keep $\theta = \theta.$ (This step is done with the help of the standard Uniform distribution)
In [109]:
##############################################
# Metropolis-Hastings
##############################################
In [110]:
function target(likelihood, prior, n, h, theta)
    if (theta < 0 ||  theta > 1)
        return 0
    else
        return (pdf(likelihood(n, theta), h) * pdf(prior, theta))
    end
end
Out[110]:
target (generic function with 1 method)
In [111]:
n = 100
h = 61
a = 10
b = 10
likelihood = Binomial
prior = Beta(a, b)
sigma = 0.3
Out[111]:
0.3
In [112]:
naccept = 0
theta = 0.1
niters = 10000

samples = zeros(niters+1)
samples[1] = theta

for i=1:niters
    theta_p = theta + rand(Normal(0, sigma))
    rho = min(1, target(likelihood, prior, n, h, theta_p)/target(likelihood, prior, n, h, theta ))
    u = rand(Uniform(0,1))
    if u < rho
            naccept += 1
            theta = theta_p
    end
    samples[i+1] = theta
end
In [113]:
println("Portion of accepted steps = " * string(naccept/niters))
Portion of accepted steps = 0.2029
In [114]:
nmcmc = Int(round(length(samples)/2))

post = Beta(h+a, n-h+b)
thetas = range(0, stop=1, length=200)

histogram(samples[nmcmc:length(samples)] ,
          size = [500, 300],
    
          label="Distribution of posterior samples", alpha = 0.5,
          legend = :topleft)
histogram!(rand(prior, nmcmc), 
          label = "Distribution of prior samples", alpha = 0.5)
plot!(thetas, 50*[pdf(post, theta) for theta in thetas], color = :red, label = "True posterior")
Out[114]:
0.00 0.25 0.50 0.75 1.00 0 100 200 300 400 Distribution of posterior samples Distribution of prior samples True posterior

We run the chain for $N$ iterations and discard the first $B$ samples. This is called burn-in.

We can run several parallel versions of the algorithm. Each of them is called a chain.

Neigbouring samples will contain similar information. We might want to save only every second, or fifth, or tenth. This is called thinnning.

Convergence diagnostics

Rigorous way of assesing convergence is an unsolved problems. But there are several tool swe can use to convice ourselves that an MCMC has converged, such as

  • trace plots need to look stationary
  • parallel chains should carry similar information
In [115]:
function mh_coin(niters, n, h, theta, likelihood, prior, sigma)
    
    samples = [theta]
    while length(samples) < niters
        theta_p = theta + rand(Normal(0, sigma))
        rho = min(1, target(likelihood, prior, n, h, theta_p)/target(likelihood, prior, n, h, theta ))
        u = rand(Uniform(0,1))
        if u < rho
            theta = theta_p
        end
        append!(samples, theta)
    end
    
    return samples

end
Out[115]:
mh_coin (generic function with 1 method)
In [116]:
n = 100
h = 61
lik = Binomial
prior = Beta(a, b)
sigma = 0.05
niters = 100
Out[116]:
100
In [117]:
chains = [mh_coin(niters, n, h, theta, lik, prior, sigma) for theta in range(0.1, stop=1, step=0.2)];

Compare multiple chains

In [118]:
p = plot(chains[1], size= [500, 500], legend =:false, xlim = [0, niters], ylim = [0, 1])
for i in 2:length(chains)
    plot!(chains[i])
end
display(p)
0 25 50 75 100 0.00 0.25 0.50 0.75 1.00

Was this painful to write a sampler by hand?

If not, bare in mind that we only wrote the simplest one possible! Sampling algorithms can get very complicated. for chain in chains pl = plot(chain) display(pl) end

Probabilistic programming languages (PPLs)

Luckily, we do not need to write a sampler by hand every time, because probabilistic programming languages (PPLs) are there to help.

A PPL allows to formalize a Bayesian model and perform inference with the help of powerful algorithms. A user needs to only formulate the model (and maybe chose a sampler) and press the inference button.

The list of currently existing PPLs is overwhelmingly long and only keeps growing:

  • BUGS, WinBUGS, JAGS,
  • Stan,
  • PyMC3, PyMC4,
  • Nimble,
  • Pyro,
  • Edward, TensorFlow Probability, Edward 2,
  • Gen,
  • Turing,
  • Stheno,
  • SOSS,
  • Omega,
  • Infer.NET

to name a few.size(chains)

There is already a couple PPLs accessible via or native to Julia.

We will work with Turing.

In [119]:
##############################################
# Turing
##############################################
In [120]:
using Turing

Turing syntaxis

Model creation

In [121]:
@model mod(y) = begin
    # model definition
end
┌ Warning: Model definition seems empty, still continue.
└ @ Turing.Core /Users/kcft114/.julia/packages/Turing/m05p3/src/core/compiler.jl:494
Out[121]:
mod (generic function with 2 methods)

Coin tossing problem - Turing

In [122]:
n = 100    # number of trials
h = 61     # number of successes

niter = 10000
Out[122]:
10000
In [123]:
@model coin(n, h) = begin
    
    # prior
    p ~ Beta(2, 2)
    
    # likelihood
    h ~ Binomial(n, p)
    
end
Out[123]:
coin (generic function with 3 methods)

Sampling

In [124]:
ch = sample(coin(n,h), NUTS(niter, 0.65));
┌ Info: Found initial step size
│   init_ϵ = 0.4
└ @ Turing.Inference /Users/kcft114/.julia/packages/Turing/m05p3/src/inference/hmc.jl:365
┌ Info: Finished 1000 adapation steps
│   adaptor = StanHMCAdaptor(n_adapts=1000, pc=DiagPreconditioner, ssa=NesterovDualAveraging(γ=0.05, t_0=10.0, κ=0.75, δ=0.65, state.ϵ=1.997853682146408), init_buffer=75, term_buffer=50)
│   τ.integrator = Leapfrog(ϵ=2.0)
│   h.metric = DiagEuclideanMetric([0.0385448])
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:67
┌ Info: Finished 10000 sampling steps in 0.585773358 (s)
│   h = Hamiltonian(metric=DiagEuclideanMetric([0.0385448]))
│   τ = NUTS{Multinomial}(integrator=Leapfrog(ϵ=2.0), max_depth=5), Δ_max=1000.0)
│   EBFMI(Hs) = 13679.723818325752
│   mean(αs) = 0.539336737579122
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:77
In [125]:
show(ch)
Object of type Chains, with data of type 9000×11×1 Array{Union{Missing, Float64},3}

Log evidence      = 0.0
Iterations        = 1:9000
Thinning interval = 1
Chains            = 1
Samples per chain = 9000
internals         = eval_num, lp, acceptance_rate, hamiltonian_energy, is_accept, log_density, n_steps, numerical_error, step_size, tree_depth
parameters        = p

2-element Array{ChainDataFrame,1}

Summary Statistics

│ Row │ parameters │ mean    │ std       │ naive_se    │ mcse       │ ess     │ r_hat   │
│     │ SymbolFloat64Float64Float64Float64AnyAny     │
├─────┼────────────┼─────────┼───────────┼─────────────┼────────────┼─────────┼─────────┤
│ 1   │ p          │ 0.60588 │ 0.0464697 │ 0.000489834 │ 0.00058591 │ 5552.61 │ 1.00008 │

Quantiles

│ Row │ parameters │ 2.5%     │ 25.0%    │ 50.0%    │ 75.0%    │ 97.5%    │
│     │ SymbolFloat64Float64Float64Float64Float64  │
├─────┼────────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ p          │ 0.512391 │ 0.575263 │ 0.606619 │ 0.636669 │ 0.697461 │
In [126]:
# read samples into array
p = convert(Array{Float64}, ch[:p].value.data[:,:,1][:,1]);
In [127]:
histogram(p, size = [300, 300], legend = false, title = "posterior density")
Out[127]:
0.5 0.6 0.7 0.8 0 200 400 600 800 posterior density
In [128]:
# traceplot 
plot(p, size = [300, 300], legend = false, title = "traceplot")
Out[128]:
0 2000 4000 6000 8000 0.45 0.50 0.55 0.60 0.65 0.70 0.75 traceplot
In [129]:
function plot_par(par)
    p1 = histogram(par, size = [400, 300], legend = false, title = "posterior density")
    p2 = plot(par, title = "traceplot")
    plot(p1, p2, layout=(1, 2), label="")
end
Out[129]:
plot_par (generic function with 1 method)
In [130]:
plot_par(p)
Out[130]:
0.5 0.6 0.7 0.8 0 200 400 600 800 posterior density 0 2000 4000 6000 8000 0.45 0.50 0.55 0.60 0.65 0.70 0.75 traceplot

Hierarchical models

Hierarchical models put priors on the hyperparamaters.

$$ y \sim f(y \mid \theta) $$$$ \theta \sim h(\theta \mid \sigma) $$

More levels of hiearchy are possible.

In this way a hierarchical model allows information to be shared between parameters $\theta,$ since they might be not independent, but rather drawn from a common distribution.

In [131]:
##############################################
# hierarchical models
##############################################
In [143]:
@model coin_hier(n, h) = begin
    
    # hyperparameters    
    alpha_hyp ~ InverseGamma(10, 2)
    beta_hyp ~ InverseGamma(10, 2)
    
    # prior
    p ~ Beta(alpha_hyp, beta_hyp)
    
    # likelihood
    h ~ Binomial(n, p)
    
end
Out[143]:
coin_hier (generic function with 3 methods)
In [150]:
niter = 20000
Out[150]:
20000
In [151]:
ch = sample(coin_hier(n,h), NUTS(niter, 0.30));
┌ Info: Found initial step size
│   init_ϵ = 0.8
└ @ Turing.Inference /Users/kcft114/.julia/packages/Turing/m05p3/src/inference/hmc.jl:365
┌ Warning: The current proposal will be rejected due to numerical error(s).
│   isfiniteθ = true
│   isfiniter = true
│   isfiniteℓπ = true
│   isfiniteℓκ = false
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/hamiltonian.jl:36
┌ Info: Finished 1000 adapation steps
│   adaptor = StanHMCAdaptor(n_adapts=1000, pc=DiagPreconditioner, ssa=NesterovDualAveraging(γ=0.05, t_0=10.0, κ=0.75, δ=0.3, state.ϵ=2.6513610946627013), init_buffer=75, term_buffer=50)
│   τ.integrator = Leapfrog(ϵ=2.65)
│   h.metric = DiagEuclideanMetric([0.0906861, 0.0890539, 0.04 ...])
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:67
┌ Info: Finished 20000 sampling steps in 0.658708255 (s)
│   h = Hamiltonian(metric=DiagEuclideanMetric([0.0906861, 0.0890539, 0.04 ...]))
│   τ = NUTS{Multinomial}(integrator=Leapfrog(ϵ=2.65), max_depth=5), Δ_max=1000.0)
│   EBFMI(Hs) = 18482.23049922853
│   mean(αs) = 0.05218346942083084
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:77
In [152]:
show(ch)
Object of type Chains, with data of type 19000×13×1 Array{Union{Missing, Float64},3}

Log evidence      = 0.0
Iterations        = 1:19000
Thinning interval = 1
Chains            = 1
Samples per chain = 19000
internals         = eval_num, lp, acceptance_rate, hamiltonian_energy, is_accept, log_density, n_steps, numerical_error, step_size, tree_depth
parameters        = alpha_hyp, beta_hyp, p

2-element Array{ChainDataFrame,1}

Summary Statistics

│ Row │ parameters │ mean     │ std       │ naive_se    │ mcse       │ ess     │ r_hat    │
│     │ SymbolFloat64Float64Float64Float64AnyAny      │
├─────┼────────────┼──────────┼───────────┼─────────────┼────────────┼─────────┼──────────┤
│ 1   │ alpha_hyp  │ 0.236544 │ 0.0841867 │ 0.000610755 │ 0.00310413 │ 678.542 │ 0.999955 │
│ 2   │ beta_hyp   │ 0.225032 │ 0.0783482 │ 0.000568398 │ 0.00312763 │ 487.225 │ 1.00011  │
│ 3   │ p          │ 0.604955 │ 0.0530292 │ 0.000384714 │ 0.00247522 │ 262.195 │ 1.00004  │

Quantiles

│ Row │ parameters │ 2.5%     │ 25.0%    │ 50.0%    │ 75.0%    │ 97.5%    │
│     │ SymbolFloat64Float64Float64Float64Float64  │
├─────┼────────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ alpha_hyp  │ 0.11868  │ 0.173982 │ 0.223655 │ 0.276504 │ 0.425165 │
│ 2   │ beta_hyp   │ 0.111732 │ 0.166139 │ 0.21462  │ 0.271694 │ 0.397669 │
│ 3   │ p          │ 0.48967  │ 0.574321 │ 0.606406 │ 0.645988 │ 0.696406 │
In [153]:
# read samples into array
p = convert(Array{Float64}, ch[:p].value.data[:,:,1][:,1]);
plot_par(p)
Out[153]:
0.50 0.55 0.60 0.65 0.70 0.75 0 200 400 600 800 posterior density 0 5.0×10 3 1.0×10 4 1.5×10 4 0.50 0.55 0.60 0.65 0.70 traceplot
In [157]:
alpha_hyp = convert(Array{Float64}, ch[:alpha_hyp].value.data[:,:,1][:,1]);
plot_par(alpha_hyp)
Out[157]:
0.25 0.50 0.75 1.00 1.25 0 200 400 600 800 1000 1200 posterior density 0 5.0×10 3 1.0×10 4 1.5×10 4 0.25 0.50 0.75 1.00 1.25 traceplot
In [160]:
beta_hyp = convert(Array{Float64}, ch[:beta_hyp].value.data[:,:,1][:,1]);
plot_par(beta_hyp)
Out[160]:
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 300 600 900 1200 posterior density 0 5.0×10 3 1.0×10 4 1.5×10 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 traceplot

Estimating parameters of the normal distribution

$$y ∼ N(\mu, \sigma^2)$$

Data generation

In [ ]:
##############################################
# normal distribution
##############################################
In [224]:
N = 2000
y = rand(Normal(0,1), N)
histogram(y, size = [300, 300], legend = false)
Out[224]:
-2 0 2 4 0 50 100 150

Turing Model: unknown mean, known variance

In [225]:
@model norm_mu(y) = begin
    
    sigma = 1
    
    # prior
    mu ~ Normal(0,0.5)
    
    # likelihood    
    for i in eachindex(y)
        y[i] ~  Normal(mu, sigma)
   end
    
end
Out[225]:
norm_mu (generic function with 2 methods)
In [226]:
ch = sample(norm_mu(y), NUTS(niter, 0.65));
┌ Info: Found initial step size
│   init_ϵ = 0.025
└ @ Turing.Inference /Users/kcft114/.julia/packages/Turing/m05p3/src/inference/hmc.jl:365
┌ Info: Finished 1000 adapation steps
│   adaptor = StanHMCAdaptor(n_adapts=1000, pc=DiagPreconditioner, ssa=NesterovDualAveraging(γ=0.05, t_0=10.0, κ=0.75, δ=0.65, state.ϵ=1.4493213954879545), init_buffer=75, term_buffer=50)
│   τ.integrator = Leapfrog(ϵ=1.45)
│   h.metric = DiagEuclideanMetric([0.000417026])
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:67
┌ Info: Finished 10000 sampling steps in 1.670022066 (s)
│   h = Hamiltonian(metric=DiagEuclideanMetric([0.000417026]))
│   τ = NUTS{Multinomial}(integrator=Leapfrog(ϵ=1.45), max_depth=5), Δ_max=1000.0)
│   EBFMI(Hs) = 6483.333347059405
│   mean(αs) = 0.8365008389526049
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:77
In [227]:
mu = ch[:mu].value.data[:,:,1]

plot_par(mu)
Out[227]:
-0.10 -0.05 0.00 0.05 0 200 400 600 800 posterior density 0 2000 4000 6000 8000 -0.08 -0.06 -0.04 -0.02 0.00 0.02 0.04 0.06 traceplot

Multiple chains

In [228]:
num_chains = 4
chains = mapreduce(c -> sample(norm_mu(y), NUTS(niter, 0.65)), chainscat, 1:num_chains)
┌ Info: Found initial step size
│   init_ϵ = 0.025
└ @ Turing.Inference /Users/kcft114/.julia/packages/Turing/m05p3/src/inference/hmc.jl:365
┌ Info: Finished 1000 adapation steps
│   adaptor = StanHMCAdaptor(n_adapts=1000, pc=DiagPreconditioner, ssa=NesterovDualAveraging(γ=0.05, t_0=10.0, κ=0.75, δ=0.65, state.ϵ=1.3751185160680217), init_buffer=75, term_buffer=50)
│   τ.integrator = Leapfrog(ϵ=1.38)
│   h.metric = DiagEuclideanMetric([0.00041332])
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:67
┌ Info: Finished 10000 sampling steps in 1.754149703 (s)
│   h = Hamiltonian(metric=DiagEuclideanMetric([0.00041332]))
│   τ = NUTS{Multinomial}(integrator=Leapfrog(ϵ=1.38), max_depth=5), Δ_max=1000.0)
│   EBFMI(Hs) = 12339.956720005377
│   mean(αs) = 0.8554619551130905
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:77
┌ Info: Found initial step size
│   init_ϵ = 0.025
└ @ Turing.Inference /Users/kcft114/.julia/packages/Turing/m05p3/src/inference/hmc.jl:365
┌ Info: Finished 1000 adapation steps
│   adaptor = StanHMCAdaptor(n_adapts=1000, pc=DiagPreconditioner, ssa=NesterovDualAveraging(γ=0.05, t_0=10.0, κ=0.75, δ=0.65, state.ϵ=1.441049484955168), init_buffer=75, term_buffer=50)
│   τ.integrator = Leapfrog(ϵ=1.44)
│   h.metric = DiagEuclideanMetric([0.000498292])
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:67
┌ Info: Finished 10000 sampling steps in 1.455153829 (s)
│   h = Hamiltonian(metric=DiagEuclideanMetric([0.000498292]))
│   τ = NUTS{Multinomial}(integrator=Leapfrog(ϵ=1.44), max_depth=5), Δ_max=1000.0)
│   EBFMI(Hs) = 12973.808820742
│   mean(αs) = 0.8070678849330645
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:77
┌ Info: Found initial step size
│   init_ϵ = 0.025
└ @ Turing.Inference /Users/kcft114/.julia/packages/Turing/m05p3/src/inference/hmc.jl:365
┌ Info: Finished 1000 adapation steps
│   adaptor = StanHMCAdaptor(n_adapts=1000, pc=DiagPreconditioner, ssa=NesterovDualAveraging(γ=0.05, t_0=10.0, κ=0.75, δ=0.65, state.ϵ=1.5273130022586943), init_buffer=75, term_buffer=50)
│   τ.integrator = Leapfrog(ϵ=1.53)
│   h.metric = DiagEuclideanMetric([0.00043099])
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:67
┌ Info: Finished 10000 sampling steps in 1.610108754 (s)
│   h = Hamiltonian(metric=DiagEuclideanMetric([0.00043099]))
│   τ = NUTS{Multinomial}(integrator=Leapfrog(ϵ=1.53), max_depth=5), Δ_max=1000.0)
│   EBFMI(Hs) = 6555.9308989341225
│   mean(αs) = 0.814720671247542
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:77
┌ Info: Found initial step size
│   init_ϵ = 0.025
└ @ Turing.Inference /Users/kcft114/.julia/packages/Turing/m05p3/src/inference/hmc.jl:365
┌ Info: Finished 1000 adapation steps
│   adaptor = StanHMCAdaptor(n_adapts=1000, pc=DiagPreconditioner, ssa=NesterovDualAveraging(γ=0.05, t_0=10.0, κ=0.75, δ=0.65, state.ϵ=1.40355160599994), init_buffer=75, term_buffer=50)
│   τ.integrator = Leapfrog(ϵ=1.4)
│   h.metric = DiagEuclideanMetric([0.000514812])
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:67
┌ Info: Finished 10000 sampling steps in 1.498614675 (s)
│   h = Hamiltonian(metric=DiagEuclideanMetric([0.000514812]))
│   τ = NUTS{Multinomial}(integrator=Leapfrog(ϵ=1.4), max_depth=5), Δ_max=1000.0)
│   EBFMI(Hs) = 7932.130209781657
│   mean(αs) = 0.8061749750402668
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:77
Out[228]:
Object of type Chains, with data of type 9000×11×4 Array{Union{Missing, Float64},3}

Iterations        = 1:9000
Thinning interval = 1
Chains            = 1, 2, 3, 4
Samples per chain = 9000
internals         = eval_num, lp, acceptance_rate, hamiltonian_energy, is_accept, log_density, n_steps, numerical_error, step_size, tree_depth
parameters        = mu

2-element Array{ChainDataFrame,1}

Summary Statistics
. Omitted printing of 2 columns
│ Row │ parameters │ mean       │ std       │ naive_se    │ mcse        │
│     │ SymbolFloat64Float64Float64Float64     │
├─────┼────────────┼────────────┼───────────┼─────────────┼─────────────┤
│ 1   │ mu         │ -0.0048235 │ 0.0224061 │ 0.000118091 │ 0.000164522 │

Quantiles
. Omitted printing of 1 columns
│ Row │ parameters │ 2.5%      │ 25.0%     │ 50.0%       │ 75.0%     │
│     │ SymbolFloat64Float64Float64Float64   │
├─────┼────────────┼───────────┼───────────┼─────────────┼───────────┤
│ 1   │ mu         │ -0.048737 │ -0.020058 │ -0.00475555 │ 0.0103441 │
In [229]:
mu = chains[:mu].value.data
plot(mu[:,:,1], label ="chain 1")
plot!(mu[:,:,2], label ="chain 2")
plot!(mu[:,:,3], label ="chain 3")
plot!(mu[:,:,4], label ="chain 4")
Out[229]:
0 2000 4000 6000 8000 -0.10 -0.05 0.00 0.05 chain 1 chain 2 chain 3 chain 4
In [230]:
histogram(mu[:,:,1], alpha = 0.5, label = "chain 1")
histogram!(mu[:,:,2], alpha = 0.5, label = "chain 2")
histogram!(mu[:,:,3], alpha = 0.5, label = "chain 3")
histogram!(mu[:,:,4], alpha = 0.5, label = "chain 4")
Out[230]:
-0.10 -0.05 0.00 0.05 0.10 0 200 400 600 800 chain 1 chain 2 chain 3 chain 4

Turing Model: unknown mean, unknown variance

In [231]:
@model norm_mu_sigma(y) = begin
        
    # priors
    mu ~ Normal(0,0.5)
    sigma ~ InverseGamma(2, 3)
    
    # likelihood    
    for i in eachindex(y)
        y[i] ~  Normal(mu, sigma)
   end
    
end
Out[231]:
norm_mu_sigma (generic function with 2 methods)
In [232]:
ch = sample(norm_mu_sigma(y), NUTS(niter, 0.65));
┌ Info: Found initial step size
│   init_ϵ = 0.0125
└ @ Turing.Inference /Users/kcft114/.julia/packages/Turing/m05p3/src/inference/hmc.jl:365
┌ Info: Finished 1000 adapation steps
│   adaptor = StanHMCAdaptor(n_adapts=1000, pc=DiagPreconditioner, ssa=NesterovDualAveraging(γ=0.05, t_0=10.0, κ=0.75, δ=0.65, state.ϵ=1.2747417524413236), init_buffer=75, term_buffer=50)
│   τ.integrator = Leapfrog(ϵ=1.27)
│   h.metric = DiagEuclideanMetric([0.000470837, 0.000264778])
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:67
┌ Info: Finished 10000 sampling steps in 2.972704221 (s)
│   h = Hamiltonian(metric=DiagEuclideanMetric([0.000470837, 0.000264778]))
│   τ = NUTS{Multinomial}(integrator=Leapfrog(ϵ=1.27), max_depth=5), Δ_max=1000.0)
│   EBFMI(Hs) = 882.3506661582911
│   mean(αs) = 0.7930821554299076
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:77
In [233]:
show(ch)
Object of type Chains, with data of type 9000×12×1 Array{Union{Missing, Float64},3}

Log evidence      = 0.0
Iterations        = 1:9000
Thinning interval = 1
Chains            = 1
Samples per chain = 9000
internals         = eval_num, lp, acceptance_rate, hamiltonian_energy, is_accept, log_density, n_steps, numerical_error, step_size, tree_depth
parameters        = sigma, mu

2-element Array{ChainDataFrame,1}

Summary Statistics

│ Row │ parameters │ mean        │ std       │ naive_se    │ mcse        │ ess     │ r_hat   │
│     │ SymbolFloat64Float64Float64Float64AnyAny     │
├─────┼────────────┼─────────────┼───────────┼─────────────┼─────────────┼─────────┼─────────┤
│ 1   │ mu         │ -0.00464717 │ 0.0227181 │ 0.00023947  │ 0.000252435 │ 8374.93 │ 0.99989 │
│ 2   │ sigma      │ 0.989695    │ 0.015715  │ 0.000165651 │ 0.000135044 │ 9000.0  │ 0.99989 │

Quantiles

│ Row │ parameters │ 2.5%      │ 25.0%      │ 50.0%       │ 75.0%     │ 97.5%     │
│     │ SymbolFloat64Float64Float64Float64Float64   │
├─────┼────────────┼───────────┼────────────┼─────────────┼───────────┼───────────┤
│ 1   │ mu         │ -0.048276 │ -0.0201671 │ -0.00480952 │ 0.0105537 │ 0.0399445 │
│ 2   │ sigma      │ 0.959298  │ 0.979116   │ 0.989499    │ 1.00024   │ 1.0211    │
In [234]:
mu = ch[:mu].value.data[:,:,1]
sigma = ch[:sigma].value.data[:,:,1]
pl_mu = plot_par(mu)
vline!([0], color = :green, label="MLE")
pl_sigma = plot_par(sigma)
vline!([1], color = :green, label="MLE")
display(pl_mu)
display(pl_sigma)
-0.10 -0.05 0.00 0.05 0.10 0 200 400 600 800 posterior density 0 2000 4000 6000 8000 -0.10 -0.05 0.00 0.05 traceplot
0.94 0.96 0.98 1.00 1.02 1.04 0 100 200 300 400 posterior density 0 2000 4000 6000 8000 0.950 0.975 1.000 1.030 1.050 traceplot

Linear regression

Likelihood:

$$ y \sim N(ax+b, \sigma^2)$$

Priors:

$$ a \sim N(0,100) $$$$ b \sim N(0,100) $$$$ \sigma \sim U(0,20) $$

Data generation

In [ ]:
##############################################
# linear regression
##############################################
In [235]:
n = 100
a_true = 6
b_true = 2
x = range(0, stop=1, length = n)
x = convert(Array, x)
y = a_true*x .+ b_true + rand(Normal(0,1), n);
In [236]:
plot(x, a_true*x .+ b_true, legend = false, size = [350, 350], color = :blue)
scatter!(x, y)
Out[236]:
0.00 0.25 0.50 0.75 1.00 2 4 6 8
In [237]:
@model lin_reg(x, y) = begin
  
  a ~ Normal(0, 10)
  b ~ Normal(0, 10)
  lp = a * x .+ b
    
  s ~ InverseGamma(2, 3)
    
  for i in eachindex(y)
    y[i] ~ Normal(lp[i], sqrt(s))
  end
end
Out[237]:
lin_reg (generic function with 3 methods)
In [238]:
niter = 20000
Out[238]:
20000
In [239]:
ch = sample(lin_reg(x, y), NUTS(niter, 0.65));
┌ Info: Found initial step size
│   init_ϵ = 0.2
└ @ Turing.Inference /Users/kcft114/.julia/packages/Turing/m05p3/src/inference/hmc.jl:365
┌ Info: Finished 1000 adapation steps
│   adaptor = StanHMCAdaptor(n_adapts=1000, pc=DiagPreconditioner, ssa=NesterovDualAveraging(γ=0.05, t_0=10.0, κ=0.75, δ=0.65, state.ϵ=0.4527591848079716), init_buffer=75, term_buffer=50)
│   τ.integrator = Leapfrog(ϵ=0.453)
│   h.metric = DiagEuclideanMetric([0.118204, 0.0407925, 0.018 ...])
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:67
┌ Info: Finished 20000 sampling steps in 3.594597784 (s)
│   h = Hamiltonian(metric=DiagEuclideanMetric([0.118204, 0.0407925, 0.018 ...]))
│   τ = NUTS{Multinomial}(integrator=Leapfrog(ϵ=0.453), max_depth=5), Δ_max=1000.0)
│   EBFMI(Hs) = 5710.433307092268
│   mean(αs) = 0.8579825157539616
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:77
In [240]:
a = ch[:a].value.data[:,:,1]
b = ch[:b].value.data[:,:,1]
s = ch[:s].value.data[:,:,1]
pl_a = plot_par(a)
vline!([a_true])
pl_b = plot_par(b)
vline!([b_true])
pl_s = plot_par(s)
vline!([1])
display(pl_a); 
display(pl_b)
display(pl_s)
5.0 5.5 6.0 6.5 7.0 7.5 0 250 500 750 1000 posterior density 0 5.0×10 3 1.0×10 4 1.5×10 4 5.0 5.5 6.0 6.5 7.0 7.5 traceplot
1.2 1.5 1.8 2.1 2.4 0 200 400 600 800 posterior density 0 5.0×10 3 1.0×10 4 1.5×10 4 1.2 1.5 1.8 2.1 2.4 traceplot
0.50 0.75 1.00 1.25 1.50 0 200 400 600 800 1000 1200 posterior density 0 5.0×10 3 1.0×10 4 1.5×10 4 0.75 1.00 1.25 1.50 traceplot

Binomial likelihood

In [ ]:
##############################################
# binomial likelihood
##############################################
In [241]:
function invlogit(x)
  exp.(x) ./ (1 .+  exp.(x))
end
Out[241]:
invlogit (generic function with 1 method)
In [242]:
n = 1000
x = rand(Normal(), n)
alpha_true = -0.3
beta_true = 0.7
ps = alpha_true .+ beta_true*x 
ps = invlogit(ps)
y = [rand(Binomial(10, p)) for p in ps];
histogram(ps, bins = 20, title = "p", label = "", size = [300, 300])
Out[242]:
0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 100 120 p
In [243]:
@model binom(x, y) = begin
  
  alpha ~ Normal(0, 1)
  beta ~ Normal(0, 1)
  
  p = invlogit(alpha .+ beta*x)

  for i in eachindex(y)
    y[i] ~ Binomial(10, p[i])
  end
end
Out[243]:
binom (generic function with 3 methods)
In [245]:
ch = sample(binom(x, y), NUTS(niter, 0.65));
┌ Info: Found initial step size
│   init_ϵ = 0.025
└ @ Turing.Inference /Users/kcft114/.julia/packages/Turing/m05p3/src/inference/hmc.jl:365
┌ Info: Finished 1000 adapation steps
│   adaptor = StanHMCAdaptor(n_adapts=1000, pc=DiagPreconditioner, ssa=NesterovDualAveraging(γ=0.05, t_0=10.0, κ=0.75, δ=0.65, state.ϵ=1.079233753616307), init_buffer=75, term_buffer=50)
│   τ.integrator = Leapfrog(ϵ=1.08)
│   h.metric = DiagEuclideanMetric([0.0004598, 0.000638437])
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:67
┌ Info: Finished 20000 sampling steps in 21.369849906 (s)
│   h = Hamiltonian(metric=DiagEuclideanMetric([0.0004598, 0.000638437]))
│   τ = NUTS{Multinomial}(integrator=Leapfrog(ϵ=1.08), max_depth=5), Δ_max=1000.0)
│   EBFMI(Hs) = 15333.80231513245
│   mean(αs) = 0.8513041124541438
└ @ AdvancedHMC /Users/kcft114/.julia/packages/AdvancedHMC/YWXfk/src/sampler.jl:77
In [246]:
show(ch)
Object of type Chains, with data of type 19000×12×1 Array{Union{Missing, Float64},3}

Log evidence      = 0.0
Iterations        = 1:19000
Thinning interval = 1
Chains            = 1
Samples per chain = 19000
internals         = eval_num, lp, acceptance_rate, hamiltonian_energy, is_accept, log_density, n_steps, numerical_error, step_size, tree_depth
parameters        = alpha, beta

2-element Array{ChainDataFrame,1}

Summary Statistics

│ Row │ parameters │ mean      │ std       │ naive_se    │ mcse        │ ess     │ r_hat   │
│     │ SymbolFloat64Float64Float64Float64AnyAny     │
├─────┼────────────┼───────────┼───────────┼─────────────┼─────────────┼─────────┼─────────┤
│ 1   │ alpha      │ -0.287012 │ 0.0211922 │ 0.000153744 │ 0.000141211 │ 19000.0 │ 1.00007 │
│ 2   │ beta       │ 0.710331  │ 0.0239084 │ 0.00017345  │ 0.000159387 │ 18697.2 │ 1.00008 │

Quantiles

│ Row │ parameters │ 2.5%     │ 25.0%     │ 50.0%     │ 75.0%     │ 97.5%     │
│     │ SymbolFloat64Float64Float64Float64Float64   │
├─────┼────────────┼──────────┼───────────┼───────────┼───────────┼───────────┤
│ 1   │ alpha      │ -0.32898 │ -0.301143 │ -0.287089 │ -0.272646 │ -0.245106 │
│ 2   │ beta       │ 0.663797 │ 0.694189  │ 0.710112  │ 0.726482  │ 0.758001  │
In [247]:
alpha = ch[:alpha].value.data[:,:,1]
beta = ch[:beta].value.data[:,:,1]
pl_a = plot_par(alpha)
pl_b = plot_par(beta)
display(pl_a); 
display(pl_b);
-0.35 -0.30 -0.25 -0.20 0 200 400 600 posterior density 0 5.0×10 3 1.0×10 4 1.5×10 4 -0.36 -0.34 -0.32 -0.30 -0.28 -0.26 -0.24 -0.22 traceplot
0.65 0.70 0.75 0.80 0 500 1000 1500 posterior density 0 5.0×10 3 1.0×10 4 1.5×10 4 0.625 0.650 0.675 0.700 0.725 0.750 0.775 traceplot

OUTRO

Bayesian Neural Networks

  • Neural networks (the tool of deep learning) is composition of functions on matrices
  • Can we supply predictions of neural networks with uncertainty?
  • Mind the gap: interpretability of priors
(Graphics by Eric Ma)
  • Bayesian deep learning: learning a probability distribution of each parameter, rather than point estimates
(Graphics by Eric Ma)

NeurIPS 2019, Bayesian Deep Learning:

videos of talks

https://slideslive.com/s/yarin-gal-22751

"Gaussian Process Behaviour in Wide Deep Neural Networks"

Alexander Matthews

R. Neal (1994):

A single layered network (MLP) with $D$ input features and $K$ hidden units, where prior is appropriately scaled, i.e.

  • normal distr with 0 mean and variance is $\frac{1}{D}$ for the "input -> Layer 1" weights
  • normal distr with 0 mean and variance is $\frac{1}{K}$ for the "Layer 1 -> output" weights

As $K$ tends to infinity, the hidden layer converges in distribution to a multivariate normal $N(0,C)$ with covariance matrix $C.$

(Proof: Mutlivariate version of the Central Limit Theorem)

"In distribution is the key!

(Alexander Matthews, NeurIPS 2019)

Matthews et al (2018),

generalisation of the result for multiple hidden layers:

  • scaling priors correspondingly (normals with variance $\frac{1}{D}$, $\frac{1}{K_1}$, $\frac{1}{K_2}$ and so on)

As $K_1, K_2, ... , K_i$ all tend to infinity, the output layer $i$ is distributed as $N(0, C^{(i)}),$

where $C^{(i)}$ is recursively defined from $C^{(i-1)}.$

(Alexander Matthews, NeurIPS 2019)

Bayesian Decision Theory

Decision Theory

Given what we currently know about the world, how should we decide what to do, taking into account future events and observations that may change our conclusions?

Decision theory offers a way to automatically design and run experiments and to optimally construct clinical trials.

The exploration-exploitation trade-off

As we gather more information, we learn more about the world. However, the things we learn about depends on what actions we take.

For example, if we always take the same route to work, then we learn how much time this route takes on different days and times of the week. However, we don’t obtain information about the time other routes take. So, we potentially miss out on better choices than the one we follow usually. This phenomenon gives rise to the so-called exploration-exploitation trade-off.

The exploration-exploitation trade-off arises when data collection is interactive.

Bayesian Decision Theory

  • $\theta$ - parameter

  • $p(\theta)$ - prior on $\theta$

  • $A$ - all possible actions one can take

  • $l: A x \Theta \to R$ - loss function

The posterior risk of an action $a \in A$ is $$r(a) = E_\theta [ l(a, \theta) \mid D] = $$

i.e. the expectation of the loss

$$ = \int l(a, \theta) p(\theta \mid D) d \theta $$

Loss functions:

  1. $l (\theta, \hat{\theta}) = (\theta - \hat{\theta})^2$. To compute the action, we need to minimize $$r(\hat{\theta}) = E [(\theta - \hat{\theta})^2 \mid D].$$ To derive the answer, take derivative of the integral with respect to $\hat{\theta}.$ And so $\hat{\theta} = E [\theta \mid D]$
  1. $l (\theta, \hat{\theta}) = \mid \theta - \hat{\theta} \mid$ -> median of the posterior

  2. $l (\theta, \hat{\theta}) = I_{\theta \ne \hat{\theta}}$ -> mode

Utility

Probability can be used to describe how likely an event is; utility can be used to describe how desirable it is.

Say again, frequentist or Bayesian?

Frequentist

  • There are many ways to come up with a point estimator: the method of moments, etc. Our responsibility is to justify the choice of the estimator.

Bayesian

  • There is no choice of the procedure of obtaining the estimate. Our choices are the prior, probability model, and (maybe) the loss function. Everything after that is computation!

How to chose a PPL?

  • functionality (e.g. availability of discrete parameters)
  • open to custom distributions (and samplers)
  • preformance
  • well documented
  • abundant library of examples
  • active (and supportive) community

An application from Pharma

an example of Julia + Turing being used for a publication

https://pubs.acs.org/doi/pdf/10.1021/acs.chemrestox.9b00264

Take home messages

1. Advantages of Bayesian inference

  • allows to use domain knowledge about the research question
  • incorporate various sources of knowledge
  • can answer a broad variety of question by computing desired qunatities from posterior samples
  • can work with small data

2. Disadvantages of Bayesian inference

  • computationally involved
  • even running a model in a PPL can take a lot of time
  • for MCMC moethods convergence is achieved, in theory, only at infinite number of iterations

3. PPLs

  • modern approach to inference
  • provide the iference "button"
  • only the model needs to be formulated
  • each PPL has its own syntaxis

Literuature and resources

learn Julia:

Bayesian Infernce:

In [6]:
using Pkg
Pkg.status()
    Status `~/.julia/environments/v1.0/Project.toml`
  [0bf59076] AdvancedHMC v0.2.2
  [c52e3926] Atom v0.8.8
  [76274a88] Bijectors v0.3.1
  [336ed68f] CSV v0.5.9
  [159f3aea] Cairo v0.6.0
  [324d7699] CategoricalArrays v0.5.5
  [3a865a2d] CuArrays v1.1.0
  [a93c6f00] DataFrames v0.19.1
  [31c24e10] Distributions v0.21.1
  [ced4e74d] DistributionsAD v0.1.0
  [587475ba] Flux v0.8.3
  [da1fdf0e] FreqTables v0.3.1
  [38e38edf] GLM v1.3.1
  [28b8d3ca] GR v0.41.0
  [c91e804a] Gadfly v1.0.1
  [93e5fe13] Hyperopt v0.2.4
  [7073ff75] IJulia v1.19.0
  [18b7da76] JuliaAcademyData v0.1.0 #master (https://github.com/JuliaComputing/JuliaAcademyData.jl)
  [e5e0dc1b] Juno v0.7.0
  [5ab0869b] KernelDensity v0.5.1
  [b964fa9f] LaTeXStrings v1.0.3
  [c7f686f2] MCMCChains v0.3.11
  [f0e99cf1] MLBase v0.8.0
  [cc2ba9b6] MLDataUtils v0.5.0
  [270893ea] MLPreprocessing v0.0.0 #master (https://github.com/JuliaML/MLPreprocessing.jl)
  [ee78f7c6] Makie v0.9.4
  [442fdcdd] Measures v0.3.0
  [47be7bcc] ORCA v0.2.1
  [429524aa] Optim v0.19.2
  [3b7a836e] PGFPlots v3.1.3
  [58dd65bb] Plotly v0.2.0
  [f0f68f2c] PlotlyJS v0.12.5
  [91a5bcdd] Plots v0.25.3
  [92933f4c] ProgressMeter v1.0.0
  [438e738f] PyCall v1.91.2
  [d330b81b] PyPlot v2.8.1
  [ce6b1742] RDatasets v0.6.3
  [682df890] Stan v3.5.0
  [2913bbd2] StatsBase v0.32.0
  [4c63d2b9] StatsFuns v0.8.0
  [3eaba693] StatsModels v0.6.2
  [f3b207a7] StatsPlots v0.11.0
  [9f7883ad] Tracker v0.2.2
  [fce5fe82] Turing v0.6.23 #master (https://github.com/TuringLang/Turing.jl.git)
  [e88e6eb3] Zygote v0.4.1
In [ ]: